ReCSAI Recursive compressed sensing artiﬁcial intelligence for confocal lifetime localization microscopy

Localization-based super-resolution microscopy resolves macromolecular structures down to a few nanometers by computationally reconstructing ﬂuorescent emitter coordinates from diffraction-limited spots. The most commonly used algorithms are based on ﬁtting parametric models of the point spread function (PSF) to a measured photon distribution. These algorithms make assumptions about the symmetry of the PSF and thus, do not work well with irregular, non-linear PSFs that occur for example in confocal lifetime imaging, where a laser is scanned across the sample. An alternative method for re-constructing sparse emitter sets from noisy, diffraction-limited images is compressed sensing, but due to its high computational cost it has not yet been widely adopted. Deep neural network ﬁtters have recently emerged as a new competitive method for localization microscopy. They can learn to ﬁt arbitrary PSFs, but require extensive simulated training data and do not generalize well. A method to efﬁciently ﬁt the irregular PSFs from confocal lifetime localization microscopy combining the advantages of deep learning and compressed sensing would greatly improve the acquisition speed and throughput of this method. Here we introduce


Introduction
The resolution of classical fluorescence microscopy is limited by the Abbe criterion (1).In the past decades, several super-resolution techniques to surpass this limit have been developed.One of them is single molecule localization microscopy (SMLM), which is based on localizing the position of individual fluorescent dyes.By fitting a model of the theoretical photon distribution, the point spread function (PSF), to the measured signal, the emitter position can be precisely determined (2).While this problem was quickly solved for perfect samples, reality is often more difficult.Overlapping or varying photon distributions as well as low signal-to-noise ratio still pose a challenge.Various approaches are used to reconstruct super-resolved positions of individual emitters, such as intensity centroids, fitting Gaussian or more complex (e.g.Zernike polynomial) functions (3), compressed sensing (4,5), and deep neural networks (6,7).An interesting application of SMLM is the simultaneous imaging of different targets using multiple colors however, suitable fluorescent dyes are very limited, and chromatic aberrations are unavoidable for different emission wavelength (8).A promising workaround is to distinguish dyes with similar emission wavelength by their different lifetime (9).This detection method is based on confocal scanning, where a laser scans the sample with a sampling rate similar to the switching rate of fluorescent dyes.This introduces distorted and disrupted PSFs that cannot be properly localized by fitting a parametric PSF model.To solve this problem, Thiele et al. (9) acquired multiple frames, projected them onto each other to obtain complete PSFs, and applied conventional fitting.An efficient method to fit the irregular, chopped PSFs in individual frames would greatly improve the acquisition speed and throughput of confocal lifetime localization microscopy.The nonlinearities of irregular, chopped PSFs require a large degree of flexibility in the fitting function while maintaining high precision.Most classical algorithms are based on fitting a Gaussian or similar function to the measured photon distribution for each individual emitter.While these methods approach the theoretical lower bound for individual emitters, they fail when emitters overlap or the PSF is irregular.
For overlapping emitters at high density, compressed sensing (CS) is superior to conventional fitting (10).CS works by

D R A F T
solving the inverse problem of recovering a super-resolved image of the emitters from a noisy, low-resolution measurement using sparsity in the spatial (4,5) or correlation (11) domain as constraint.Due to its high computational demands, CS has so far not found widespread use.Artificial Neural Networks (ANN) are well suited for fitting complex PSFs, as they are essentially high-dimensional function approximators.Recently, ANN-based fitters such as DeepSTORM (6) or DECODE (7) achieved outstanding results in a SMLM reconstruction benchmark (10), beating CS at high emitter density.Since the iterative process of compressed sensing can be expressed as a differentiable operation, it should be possible to integrate it into a neural network to combine the advantages of both approaches.Here, we present a novel trainable fitting algorithm that combines wavelet denoising, compressed sensing, and deep learning to recover emitter locations from confocal dSTORM data with chopped, irregular PSFs.We developed a simulator producing accurate ground truth for training, including various distortions and noise as well as temporal context over multiple frames.We then trained different neural networks combining existing approaches with a novel trainable CS layer and wavelet filters, and evaluated and compared their performance on simulated and experimental data using common accuracy metrics as well as Fourier Ring Correlation (12) and LineProfiler (13).

Fluorescence Lifetime Imaging Microscopy (FLIM).
All single molecule fluorescence lifetime measurements were performed on a MicroTime200 (PicoQuant, Berlin, Germany) time-resolved confocal fluorescence microscope setup consisting of a FLIMbee galvo scanner (PicoQuant, Berlin, Germany), an Olympus IX83 microscope including an oilimmersion objective (60×, NA 1.45; Olympus), 2 single photon avalanche photodiodes (SPAD) (Excelitas Technologies, 75154 K3, 75154 L6) and a TimeHarp300 dual channel board.Pulsed excitation was performed using a whitelight laser (NKT photonics, superK extreme) which was coupled into the MicroTime200 system via a glass fiber (NKT photonics, SuperK FD PM, A502-010-110).A 100 µm pinhole was used for all measurements.The emission light was split onto the SPADs using a 50:50 beamsplitter (Pico-Quant, Berlin, Germany).To filter out afterglow effects of the SPADs as well as scattered and reflected light, two identical bandpass filters (ET700/75 M, Semrock, 294808) were installed in front of the SPADs.The measurements were performed and analyzed with the SymPhoTime64 software (Pi-coQuant, Berlin, Germany).The microtubule measurements were performed with an irradiation intensity of 5 kW cm −2 in T3 mode with 25 ps time-resolution.The pixel dwell time was 100 µs and a monodirectional line frequency of 108.7 Hz was used.The corresponding frame frequency was 2.4 Hz.Measurements were performed in PBS-based photoswitch-

D R A F T
ing buffer containing 100 mM β-mercaptoethylamine (MEA, Sigma-Aldrich) adjusted to pH 7.6.
Simulation.The core of our simulation is an artificial space S N,N , with dimension N = s px * s im , where s px denotes the pixel size and s im denotes the image size in pixels.S is thus a sub-lattice of the image I with nanometer resolution.It represents one frame with only a small amount of emitters in the fluorescent ON state.These emitters are simulated with the following properties: A spatial position in x-and y-direction L x , L y ∈ [0, N ], a lifetime distribution (Poisson, t = 90), a switch-on countdown (Poisson, t = 90) and a photon count (ph = randint(800,1500), Gaussian distribution σ = 0.2 ph).We define an empty subset of L ON describing points in a fluorescent ON state.L ON is recalculated each line, adding emitters based on a Poisson distribution P λ (k) = λ k k! e −λ and deleting those which returned to a dark state.We further divide S into horizontal lines of size s px , representing the rasterization process of the detector (blue line in Figure 1b).A time variable t is increased with each horizontal line by ∆t = 6 ms.The timestep of column-wise movements is neglected in our simulations since it is one order of magnitude faster.If the switch ON countdown of a localisation is larger than zero, it is decreased by ∆t and the localisation is not rendered in the current line.If the variable drops below zero, the localisation is rendered and ∆t is subtracted from the remaining lifetime.Localisations surpassing their lifetime are not rendered in the subsequent lines and are deleted from L ON .The rendering process adds a crop of the PSF, corresponding to the localisations relative x-and y-position, to S. This crop equals the line width in y-dimension and is multiplied by the photon count of the emitter.A localisation has to be rendered for at least 40% of its ON-time to be accepted as a true positive.For the PSF shape, we use an airy disc model from astropy (14) with a varying radius r ∈ [525, 555].However, this kernel can be easily replaced by a measured PSF for further applications.The simulated emitters are incomplete at the top or bottom.This depicts the switching process into the fluorescent ON/OFF state during the acquisition and is a typical feature of FLIMbee measurements.Subsequently, the image is resized by opencv's (15) InterArea interpolation to s im .Noise is added corresponding to (16).For our training we simulated 9 × 9 × 3 crops of SMLM data.Each crop contains n ∈ [0, 10] localisations.
Reconstruction.Our reconstruction pipeline is composed of several steps.First, regions of interest are detected by a trainable wavelet based peak detection layer and cropped to a 9 × 9 × 3 patch around the detected maximum, taking the temporal context of the previous and subsequent frame into account.The selected crops are further processed in one of the network architectures described in the following sections, ultimately creating a feature space describing the predicted emitters.This feature space equals the original crop data in size and contains a stack estimating the positions ∆x and ∆y relative to the pixel center, the emitter intensity N , the corresponding uncertainties σ x , σ y , σ N , the probability p for a pixel to contain an emitter, and an estimation B of the local background.This output format as well as the loss function were adapted from DECODE (7).If an emitter is close to the edge of a pixel, i.e. ∆x/∆y are close to 0.5, the corresponding probability is often distributed over two adjacent pixels.Therefore, we defined the following conditions to retrieve a localisation: If a classifier pixel value exceeds the given threshold, a cross shaped filter is applied.If the convolved pixel exceeds 0.7, the pixel with the highest value of that formation is accepted as localisation.If it exceeds 1.4, the pixel with the second highest value is also accepted.

Trainable wavelet layer.
To reduce the dimensionality of the reconstruction problem, we developed a binning that crops localizations to a ROI of 9 × 9 pixels.These ROIs are identified with a trainable wavelet filter bank.To ensure perfect reconstruction, deconstruction and reconstruction filters share the same weights.Orthogonality of the filter bank is provided by coupling the learning process of low pass lp and high pass hp filter banks with a Gram-Schmidt process: A bias followed by a ReLU activation is applied to the decomposed frequency images.A filtered image can then be reconstructed using the decomposed frequency images and the inverted filter bank.Given the noise and disruption-free ground truth as training data, the algorithm is able to filter spatial frequency components that are "PSF-like".Potential localisations can then be identified by a local maximum detection.The denoised data can be of additional use for difficult reconstructions, e.g., when PSFs are shifted or disrupted as in FLIMbee measurements.

Trainable CS layer.
A major challenge of CS algorithms is the choice of suitable hyperparameters.We implemented the fast iterative shrinkage-thresholding algorithm (FISTA) (17), where the thresholding parameter λ significantly affects the number of iterations needed for convergence.Higher λ implies more background information to be filtered and leads to a faster convergence but can on the other hand lead to undetected localisations.Smaller λ implies less background and can lead to the detection of false positives.In classical approaches, λ is set globally dependening on the noise level.Since our method works with ROIs, an appropriate λ depending on the local noise level can be estimated.We implemented a classical CNN (Convolutional Neural Network) consisting of three convolutional layers followed by three dense layers, predicting a specific λ parameter for each crop (supplementary Fig. 7).Constraining this part of the network is challenging, as the network easily loses its gradient either by converging to a high λ, resulting in a zero output, or by converging to zero, resulting in no benefit of the compressed sensing operation.Therefore, it is crucial to regularize this step and to use suitable layer initialization.The λ estimation part of our network is followed by a sigmoid activation, multiplied by 0.025 corresponding to the maximal λ for a noiseless image that does not result in zero as output.Dense layers are initialized with a random normal distribution of mean µ = 0.5, standard deviation σ = 0.3 and truncated normal bias.We further implemented a functional test displaying the output of the first inception layer together with the estimated λ and the CS-reconstructed sub-lattice image to monitor the λ estimation process.

D R A F T
Network architecture.To combine CS and artificial intelligence we implemented and evaluated the following network architectures: CS CNN.Our first approach for a network design was to use a simple CNN as shown in Figure 2 a.We used the aforementioned FISTA layer as a prior, downsampled the sub-lattice back to the input dimensions, concatenated the original input image, and applied a set of convolutions to generate the eight described feature space maps.
CS Inception.In a more sophisticated approach (Fig. 2b), we integrated the concept of CS deeper into machine learning rather than using it only as a prior for computation.We built an architecture (Fig. 7) similar to Inception (18).The aim is to run a first inception layer with a very low CS iteration count as a prior for a second inception layer with a higher iteration count.While inception layer 1 can focus on improving the image quality, inception layer 2 can reconstruct coordinates with a lower error rate, i.e. compute a higher λ, resulting in faster convergence.The output of inception layer 2 is processed in a convolutional path similar to the first approach, reconstructing to eight feature layers in the original image dimensions.
CS U-Net.The U-Net ( 19) is a widely used neural network architecture for image-to-image tasks and was for example used in DECODE (7).The dimension of the input is step by step reduced in a down-sampling path, ultimately resulting in a dense feature space.In the subsequent up-sampling path, the dense information is combined with the corresponding layers of the down-sampling path containing spatial information.
Recursive U-Net.A recent promising approach replaced the high iteration count of classical compressed sensing with ReLU activated convolutional layers (20).The feature space is connected to the sub-lattice via downsampling layers.The feature space is updated with additional details in each iteration.Therefore, the current estimation x(t + 1) is generated as BN (x(t) + x(0) + update), where BN describes a batch normalization, similar to ResNet (21).Since it is difficult to constrain an output in a convolutional network as sparse and our aim is to reconstruct a feature space in the original image size, predicting the coordinate center pixel and offset, we propose an iterative encoding and decoding between feature and image space.The model can be seen in Figure 2 c.In an initial step, we compute a first estimation for the feature space F (0).This estimation is updated F (N + 1) = F (N ) + F update by encoding the feature space to image space, adding the noise estimation F bg , calculating the element wise difference to the original image and subsequently encoding the obtained deviation back to feature space for element-wise addition with the previous estimation.

Activation.
A detailed visualization of the activations is shown in supplementary Figure 6.We used a sigmoid function on the output slice of the classifier image, to map each output pixel to a probability p ∈ [0, 1].We constrained σ x , σ y and σ N to ∈ [0, 3] with a sigmoid activation multiplied by We also considered softmax as activation function for the classifier image.Despite a higher learning rate, this approach has major drawbacks.Outputs always cover the full range p ∈ [0, 1].This gives rise to false positives if there is no localisation within the observed region, or false negatives if more than one active localisation is present.

Loss function.
The loss function of our network is composed of several components and was adapted from DE-CODE (7).We implemented a localisation loss for predicted emitters, a count loss for an accurate number of localisations, a prediction probability close to one or zero, and a back-ground loss predicting the noise level.The localisation loss represents a Gaussian mixture model of the probability p i for every pixel to contain a localisation, the position of said localisation x i = x px + ∆x, y i = y px + ∆y, where y px , x px denote the coordinates of the current pixel and ∆x, ∆y its value, the estimated intensity N as well as the estimated error σ for each variable: (3) where x t and y t denote the ground truth coordinates and N t the ground truth intensity.The count loss is

D R A F T
where σ c = i p i (1 − p i ) encourages results close to 0 and 1 and therefore, reduces uncertainty.The background loss is where B i denotes the predicted background and N i the noiseless ground truth.The total loss is the sum of the individual loss functions: Model Evaluation.To evaluate the performance of our approach, we use the RMSE (Root Mean Squared Error) and the Jaccard-Index JI: where T P are the true positives, F P the false positives and F N the false negatives.For real datasets with unknown ground truth, we used the Fourier Ring Correlation (FRC) as proposed in ( 22) ( 12) Training Procedure.Using our simulation, we created a dataset consisting of 40 batches each containing 4×1000 crops.Three of these sub-batches are used for training and one for evaluation.While the noise simulations are completely random for each crop, sigma is different for each batch and in the range of σ ∈ [175, 185].The network is trained for 150 iterations followed by one evaluation circle, where we compute the JI, the RMSE and a validation loss.We used an Adam optimizer with a learning rate of 10 −4 .Neural networks were implemented in Tensorflow 2 and trained on a Nvidia GTX 1080 TI GPU.

Ground truth from simulated FLIMbee experiments.
The first step before training a deep neural network is to obtain suitable ground truth.Training data can either be obtained from experiments and labeled by hand or by existing algorithms, or it can be generated using simulations.Since dSTORM FLIMbee experimental data are difficult to measure and the performance of classical reconstruction algorithms is limited, we used the latter and developed a computer simulation of the FLIMbee measurement process.Using this tool, we simulated a dataset consisting of 40 batches, each containing 4×1000 crops.Three of these sub-batches were used for training and one for evaluation.
Trainable wavelet filter to find regions of interest.Reconstructing SMLM data requires a lot of computational power, as each super-resolved image is composed of millions of localisations retrieved from thousands of frames.On top of that, compressed sensing algorithms are computationally expensive: The size of the reconstruction matrix and therefore the speed of the reconstruction scale with the fourth power of image size, m 4 .We reduce the dimension of the reconstruction problem by implementing a differentiable wavelet filter bank, trained to search for frequencies that resemble a PSF.ROIs are then cropped in a 9 × 9 × 3 area around the detected maximum, taking the temporal context of the previous and subsequent frame into account.

Deep neural networks with CS.
To combine the advantages of CS with the benefits of artificial intelligence, we implemented and trained four different network architectures.The first approach was a simple CNN using the CS layer as a prior (Fig. 2 a).In the second approach, we used an Inception-like architecture (18) combined with CS.This network has two steps with different iteration counts and values of λ (Fig. 2 b).The third approach combines a classical U-Net architecture with CS, and the fourth network is a recursive U-Net-like architecture mimicking the iterative structure of CS.In all cases, the first layer has a size of 9 × 9 × 3 pixels, using the ROIs identified by the trainable wavelet transform as input.To further constrain the compressed sensing part of our network we tried to implement an additional loss term for the CS layer to encourage a sparse reconstruction.For this term, we track the normalized compressed sensing output b of the inception layers and penalize entries differing from zero using the L 1 loss.To prevent the compressed sensing part form diverging to zero, we apply the convolution matrix A to the CS output: s = Ab.In case of an optimal reconstruction, this operation convolves a sparse sub-lattice with the measurement function while downsampling to the original image size.The result s is a denoised version of the original image.The loss can then be computed as squared difference to the noiseless training data: Since this did not improve the results significantly we discarded the loss term in our final network versions.
Comparison of network architectures.We compared the performance of the four different network architectures on a separate simulated validation dataset.We computed the RMSE, JI and validation loss every 150 steps on an independent test batch during training (Fig. 3).The recursive U-Net achieved the best results compared to the other methods.It can be observed that CS in the form of FISTA is a solid prior for the Inception-like network, leading to an early increase in metrics.For higher iterations, however, the network metrics converge.tecture.A detailed evaluation of the training and evaluation performances is shown in Table 1.
Application to experimental data.We performed experiments with a FLIMbee galvanometric scanner as described in the methods (Fig. 4) and tested the trained networks on these real data.The raw data shows the typical interrupted PSF as well as a varying intensity between lines.Trained for these non-linearities, our network is able to precisely predict the center of these localisations.To assess the quality of prediction on experimental data, we used FRC and LineProfiler, as objective ground truth was not available in this case.The results (CSInception: 0.211 Rec U-Net: 0.265) indicate an improved reconstruction quality compared to classical fitters like ThunderSTORM (0.167).Note that an additional drift correction improves the quality of the reconstruction significantly.For the image in Fig. 4 we applied a linear drift.For the other images in Fig. 5 we used a the ThunderSTORM RCC drift correct.

Discussion
We developed a robust data simulator for FLIMbee SMLM measurements, combining the method-specific disrupted PSFs with accurate noise simulations.We furthermore introduce a learnable wavelet filter that can be trained to accurately detect emitters and crop them to enhance the speed of the evaluation pipeline.Finally, we implemented and evaluated different approaches to integrate compressed sensing operations into deep neural networks for the reconstruction of super-resolution images from nonlinear disrupted PSFs.The Rec U-Net architecture achieved the best JI and RMSE performance on simulated data as well as the best FRC score on real data, while architectures with CS-like sparse representations did not perform as well.This indicates that sparse representations might not be optimal for the learning process of neural networks.A possible reason may be the large amount of zero values in the sparse representation, leading to a vanishing gradient for large fractions of the feature space.Neural networks might be better suited to create a parameterized representation of the sparse sub-domain, like compressed sparse row (23), or to directly compute the feature space representation as proposed in the Rec U-Net.The trainable wavelet filter is an efficient way to identify regions of interest in SMLM data.Trained on realistic simulated data, it is able to filter background frequencies and to accurately determine regions of interest.Fitting the center of sparsely activated emitters is a redundant problem, so preselecting regions of interest has several advantages over reconstructing a whole image.The subsequent reconstruction network is scalable, since the reconstructed regions of interest always have identical size.On top of that, training duration and network depth can be reduced drastically.However, there are cases where prefiltering has its limits or even introduces disadvantages.High density samples pose a problem since emitters overlap and have less resemblance with the original PSF.Data that diverges too much from the original training data can also lead to loss of localisations.For the given problem of low density emitters with disrupted PSFs, however, it is an efficient way to identify regions of interest.
As stated in (7), inaccurately estimated localisations tend to be reconstructed towards the center of a feature space pixel.This can be overcome by adjusting the precision threshold and/or the reconstruction method (local maxima).Interestingly, this feature was also observed for classical compressed sensing methods like (24) and seems to be a general problem of discrete feature spaces.The fact that metrics converge at high iterations may be caused by the linear convergence of L1 minimization algorithms, as the available information before full convergence is limited.Another possible explanation is that the training process is able to extract the necessary information even with low iteration counts.Interestingly, the best results in terms of JI and RMSE do not coincide with the best validation loss.Possible explanations of this behavior include overfitting, or local minima with very low underestimated localisation uncertainty.

D R A F T
As can bee seen from Table 1, the initial FISTA layer introduces a large computational cost, making the Rec U-Net approach much faster than the CS-like implementations.This architecture resembles the unfolding of CS interations in a deep neural network, as proposed by Gregor and LeCun (25).An approach, that has already been applied to SOFI super-resolution imaging using sparsity in the correlation domain (26) by unrolling the iterative FISTA compressed sensing algorithm into a deep neural network.Our results confirm that algorithm unfolding is an efficient method to combine the advantages of iterative compressed sensing methods with deep learning for reconstruction of high-resolution microscopy data and will likely see many other applications in the future.
The current state-of-the-art for fitting classical localization data according to the SMLM challenge (10) is DECODE (7).In this work, three independent U-Nets were applied to three consecutive frames to detect localizations.It was not possible to perform a direct comparison of our approach to DECODE, since its training process is tightly coupled to a simulation of frames with a spline-parameterized PSF that is not compatible with our confocal dSTORM data.If the training process could be adapted to incorporate our simulator, it would be interesting to see if it performs as well as our CS U-Net-based approach, since we adapted our loss function and output format from DECODE.

Conclusions
We developed a data generator for nonlinear PSFs in the context of super-resolved confocal lifetime imaging and were able to reconstruct localisations with improved accuracy compared to classical fitters by developing and training an artificial neural network.Next to an improvement in computa-tion time, we demonstrate the adaptation of compressed sensing to deep neural networks for reconstructing non-linearly varying PSFs.Our results indicate that using a deep architecture like inception is beneficial to the models performance.
Including local context by reconstructing to the original crop size as well as including the temporal context of the previous and subsequent frame improves the reconstruction quality significantly.Implementing compressed sensing into artificial neural networks is a promising concept, but further work has to be done to improve the implementation details.
For an optimal solution, the CS part should fully converge.This is, however, computationally demanding since every iteration contains a nontrivial derivation used by the network for back-propagation.In comparison, algorithm unfolding appears to be a more efficient way to integrate compressed sensing and deep learning.

Fig. 1 .
Fig. 1.Confocal dSTORM data acquisition process and data simulation.(a) The pulsed 640 nm excitation light is converted to radial polarisation with a quarter wave plate (QWP) after passing through a single mode fibre (SMF), reflected by a beam splitter (BS) into a galvanometric laser scanner and focused by an oil immersion objective.The collected fluorescent emission from the sample is descanned, passed through the BS, reflected by mirrors (M) and focused onto the pinhole (PH), then onto the single-photon avalanche photodiodes (SPAD) using lenses (L1, L2, L3 and L4).Band pass filters (BP) block scattered excitation light and prevent afterglow effects of the detectors.(b) The galvo scanner is a one-pixel scanner rastering the image line by line.At time t, only the blue marked part of S is active, representing the horizontal acquisition line of the FLIMbee detector.The acquisition speed in x-direction is sufficiently fast to be neglected during simulations.Only active fluorophores overlapping into the active part of S are rendered (green).

Fig. 2 .
Fig. 2. Network models.Image data containing the temporal context of the previous and subsequent frame are processed in different network models.(a) CS CNN uses compressed sensing as a prior and applies several convolutional layers.(b) CS Inception integrates the CS component deeper into the neural network.(c) CS U-Net uses compressed sensing as a prior and computes the feature space with a U-Net architecture (d) ReC U-Net aims to unroll the CS algorithm with iterative encoding and decoding from image to feature space and vice versa.For all network models the feature space is processed with sigmoid and tanh activations and fed into a Gaussian mixture model to compute the loss.three to limit the standard deviation to a reasonable interval.A tangens hyperbolicus activation was applied to the subpixel coordinates ∆x and ∆y to clip these values into the range of [−1, 1].This is important to maintain the advantages of local reconstruction while neglecting localisations beyond the local context.We also considered softmax as activation function for the classifier image.Despite a higher learning rate, this approach has major drawbacks.Outputs always cover the full range p ∈ [0, 1].This gives rise to false positives if there is no localisation within the observed region, or false negatives if more than one active localisation is present.

Fig. 3 .
Fig. 3. Comparison of different network architectures.(a) Validation loss over training steps.(b) Jaccard Index and RMSE of the tested models.

Fig. 4 .
Fig. 4. Reconstruction of FLIMbee dSTORM microtubuli.(a) Fitting of disrupted PSFs with artificial intelligence.Red crosses denote the estimated location of a fluorophore.(b) Reconstructed super-resolution image.Scale bar = 10 µm (c) Line profile of the microtubule marked blue in (b).

Fig. 5 .Fig. 6 .
Fig. 5. Comparison of FLIM data evaluated with thunderstorm and our method.Reconstructed images of AI (left) and Thunderstorm (right).Comparing the first and the second half of the localisation data, we obtain a Fourier Ring Correlation Coefficient (12) of 0.310; 0.187; 0.265 (left) and 0.179; 0.187; 0.167 (right) respectively.Scale bar = 10 µm

Fig. 7 .Fig. 8 .
Fig.7.Inception building block.This building block is derived from the inception network(18).The input is processed in 4 different paths (from left to right).We estimate the compressed sensing parameter λ with a conventional CNN.The input is processed by a bottleneck layer, followed by the compressed sensing layer.Several convolutional layers restore the original image size.A feature detector applies asymmetric filters from different directions.A pass-through only applies an activation function, passing forward the original image.

Table 1 .
national Edition, 47(33):6172-6176, August 2008.ISSN 14337851, 15213773.doi: 10.1002/anie.200802376.Training and inference time of different network architectures on a Nvidia GTX 1080 TI.Training times are measured per epoch.For the evaluation of inference time, a FLIMbee dataset with 4500 frames of 45x45 px is used.